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Abstract 

Recent works have validated the possibility of energy efficiency improvement in radio access 
networks (RAN), depending on dynamically turn on/off some base stations (BSs). In this paper, we 
extend the research over BS switching operation, which should match up with traffic load variations. 
However, instead of depending on the predicted traffic loads, which is still quite challenging to precisely 
forecast, we firstly formulate the traffic variation as a Markov decision process (MDP). Afterwards, in 
order to foresightedly minimize the energy consumption of RAN, we adopt the actor-critic algorithm 
and design a reinforcement learning framework based BS switching operation scheme. Furthermore, to 
avoid the underlying curse of dimensionality in reinforcement learning, we propose a transfer actor- 
critic algorithm (TACT), which utilizes the transferred learning expertise in neighboring regions or 
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historical periods. The proposed TACT algorithm provably converges and contributes to a performance 
jumpstart. In the end, we evaluate our proposed scheme by extensive simulations under various practical 
configurations and prove the feasibility of significant energy efficiency improvement. 

Index Terms 

radio access networks, base stations, sleeping mode, green communications, energy saving, rein- 
forcement learning, transfer learning, actor-critic algorithm 

I. Introduction 

The explosive popularity of smartphones and tablets has ignited a surging traffic load demand 
for radio access and has been incurring massive energy consumption and huge greenhouse gas 
(GHG) emission [QJEO. Specifically speaking, the information and communication technologies 
(ICT) industry accounts for 2% to 10% of the world's overall power consumption [3] and has 
emerged as one of the major contributors to the world-wide C0 2 emission. Besides that, there 
also exists economical pressure for cellular network operators to reduce the power consumption 
of their networks. It's envisioned that the power bill will doubly enlarge in fives years for China 
Mobile |0. Meanwhile, the energy expenditure accounts for a significant proportion of the overall 
cost. Therefore, it's quite essential to improve the energy efficiency of ICT industry. 

Currently, over 80% of the power consumption takes place in the radio access networks (RAN), 
especially the base stations (BSs) 0. The reason behind this is largely due to that the present 
BS deployment is on the basis of peak traffic loads and generally stays active irrespective of 
the traffic load [6] while the traffic loads virtually vary heavily 0. Recently, there has been 
a substantial body of work towards traffic load-aware BSs adaptation [8] and the authors have 
validated the possibility of energy efficiency improvement from different perspectives. Luca 
Chiaraviglio et al. flU showed the possibility of energy saving by simulations. lUOl and |fTT| 
proposed how to dynamically adjust the working status of BS, depending on the predicted traffic 
loads. However, to reliably predict the traffic loads is still quite challenging, which makes these 
works suffering in practical configurations. On the other hand, [12] and [fT3l presented dynamic 
BS switching algorithms with the traffic loads a prior and preliminarily proved the effectiveness 
of energy saving. 

Besides, it is also found that turning on/off some of the BSs will immediately affect the BS, 
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with which a mobile terminal (MT) should be associated. Moreover, subsequent user's association 
choice in turn leads to the traffic load differences of BSs. Hence, any two consecutive BS 
switching operations are correlated with each other and current BS switching operation will also 
further influence the overall energy consumption in the long run. In other words, the expected 
energy saving scheme must be foresighted while minimizing the energy consumption. It should 
concern its effect on both the current and future system performance to deliver a visionary BS 
switching operation solution. 

(6) presented a partially foresighted energy saving scheme which combines BS switching 
operation and user association by giving a heuristic solution on the basis of a stationary traffic 
load profile. In this paper, we try to solve these problem from a different perspective. Instead of 
predicting the volume of traffic loads, we apply Markov decision process (MDP) to model the 
traffic load variation. Afterwards, the solution to the formulated MDP model, i.e., BS switching 
operation (and corresponding user association as well) strategy, can be attained by making use 
of actor-critic algorithm |fl4 |lfT5l . a reinforcement learning (RL) approach lfT6l . one advantage 
of which is that there is no necessity to possess a prior knowledge about the traffic loads within 
the BSs. Within the reinforcement learning framework, a BS switching operation controlled as 
illustrated in Fig. [T] firstly estimates the traffic loads variation based on the on-line experience. 
Consequently, the controller can select one of the possible BS switching operations under the 
estimated circumstance and then decreases or increases the probability of the same action to 
be selected lately based on the needed cost. Here, the cost refers to the energy consumption 
due to such a BS switching operation. After repeating the actions and getting the corresponding 
cost, the controller would know how to choose the active BSs under one specific traffic load 
circumstance. Moreover, with the MDP model the resulting BS switching strategy is foresighted, 
which would improve energy efficiency in the long run. 

However, some question may arise as the RL approaches usually suffer from the curse of 
dimensionality and master tasks with a large set of states and actions slowly [fT71[fT8l . Hence, a 
direct application of the RL approaches may sometimes get into trouble, because a BS switching 
operation controller usually takes charge of tens or even hundreds of BSs [fTTTl . In this paper, 

'in practice, such a centralized BS switching operation can be conducted by the base station controller (BSC) in second 
generation (2G) cellular networks or the radio network controller (RNC) in third generation (3G) or long term evolution (LTE) 
cellular networks. In this paper, we generalize it as a BS switching operation controller. 
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we deal with the application problem by utilizing the conceptual idea of transfer learning (TL) 
|fT9ll - [|22l . TL, which mostly concern how to recognize and apply the knowledge learned from 
one or more previous tasks (source tasks) to more effectively learn to solve novel task (target 
task) BUI , is intuitively appealing, cognitive inspired, and has led to a burst of research activities. 
Meanwhile, the spatial and temporal relevancy in the traffic loads ll23l make it meaningful to 
transfer the learned BS switching operation strategy in neighboring region at historical moments 
(source task) to help speed up the learning process in regions of interest (target task) as depicted 
in Fig [T] As a result, the learning framework of BS switching operation is further addressed by 
incorporating the idea of TL into the classical actor-critic algorithm (AC) and present a Transfer 
Actor-CriTic algorithm (TACT) in this paper. 

In a nutshell, our work proposes a reinforcement learning framework to energy saving scheme 
in RANs. Beyond that, compared to the previous work, this paper provides the following three 
key insights: 

• Firstly, we show that the learning framework scheme is feasible to save the energy consump- 
tion in RANs without the knowledge of traffic loads a prior. Moreover, the performance 
of the learning framework scheme approaches that of the state of the art scheme (SOTA), 
which is assumed to have fully knowledge of traffic loads. These preliminary results have 
already been presented in ll24l . 

• Secondly, we extend the idea of TL to the conventional RL algorithms and show that the 
proposed transfer actor-critic algorithm (TACT) outperforms the classical AC algorithm with 
a performance jumpstart. 

• Thirdly, this paper details the convergence analysis of the TACT algorithm and thereby 
contributes to the general literature in RL filed, especially the general AC algorithm. 

The remainder of the paper is organized as follows. In Section |n| we introduce the system 
model and formulate the traffic variation as an MDR In Section [Till we talk about energy saving 



scheme by the conventional RL framework. Section IV focuses on the incorporation of idea of 
TL into the conventional RL framework and investigates the convergence proof of the transfer 
actor critic algorithm. Section [V] evaluates the proposed schemes and presents the validity and 



effectiveness. Finally, we present a conclusion of this paper in Section VI 
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II. System model and problem formulation 

A. System model 

An RAN usually consists of multiple BSs while the traffic loads of BSs are usually fluctuating, 
thus often making BSs under-utilization. In this paper, let's assume that there exists a region 
£gK 2 served by a set of overlapped BSs B = {1, . . . , N} as Fig. [I] depicts, i.e., Region 1 or 
Region 2. In addition, there exists a BS switching operation controller, which can timely know 
the traffic loads in these BSs at current stage and correspondingly determine the energy efficient 
working status of any BS (i.e., active/sleeping mode) at next stage in a centralized way. Beyond 
that, the paper focuses on downlink communication, i.e., from BSs to MTs. Meanwhile, the file 
transmission requests at a location x E C arrive following a Poisson point process with arrival 
rate per unit area X(x) and file size ^k-. After that, the traffic load density at a location x E C 
is defined as j(x) = \(x)/n(x) < oo [6]. Therefore, the traffic load density can capture the 
spatial traffic variations. For example, a hot spot can be characterized by a high arrival rate 
and/or possibly large file sizes. Hence, when the set of BSs B on is turned on, the traffic loads 
severed by BS i E B on can be represented as T { = f c ^(x)Ii(x, B on ) dx, whereas B on ) = 1 
is a user association indicator and denotes location x is served by BS i E B on and vice versa. 
Otherwise, if a BS i is in sleeping mode, i.e., i E B \ B on , the traffic load is defined as zero, 
namely Tj = 0. To demonstrate the traffic load variation condition, i.e., p(rf +1 |rf), we use a 
finite state Markov process (FSMC). Moreover, the traffic load Ti for BS i is partitioned into 
two parts by a boundary point T b . Here, T b can be the average traffic loads in one BS over a 
certain period, thus feasible to be known in advance based on the historical records. Therefore, 
the traffic loads for a specific BS have merely two states, i.e., Sj = if Tj < r& and Sj = 1 if 
Ti > Tb. Subsequently, a state vector s = {si, ■ ■ ■ , sn} E S — Si x • • • x Sn is constructed to 
model the traffic load variation for the region of interest. 

Let's denote the transmission rate of a user located at x and served by BS i E B on as B on ). 
For analytical convenience, assume that Ci(x,B on ) does not change over time, i.e., we do not 
consider fast fading or dynamic inter-cell interferences. Instead, B on ) is assumed as a time- 
averaged transmission rate in this paper, based on the fact that the time scale of user association 
is commonly much larger than the time scale of fast fading or dynamic inter-cell interferences. 
Hence, the inter-cell interference is considered as static Gaussian-like noise, which is feasible 
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under interference randomization or fractional frequency reuse, also consistent with the model 
in ||6lll25l . Beyond that, though Ci(x, B on ) is location-dependent, it is not necessarily determined 
by the distance from the BS i due to the shadowing effect. 

Furthermore, the system load density can be defined as the fraction of time required to deliver 
traffic load 7(3;) from BS i E B on to location x, namely Qi(x) = j(x)/ci(x,B on )- Analogous to 
the definition of traffic load, the system load for an active BS i E B on can be represented as 
Pi — f c Qi(x)Ii(x, Bon) dx. Meanwhile, the system load for a sleeping BS i is defined as zero, 
namely pi = 0, if % E B \ B on . Hence, the indicator set I = {Ii(x, B on ) \i E B, x E £} is feasible 
E6ll if one BS can serve p, t < l,Vi E B. Eventually, our goal is to choose certain active BSs and 
find a feasible user association indicator set to minimize the overall energy consumption. By 
exploiting the proposed learning framework, the controller can know the BS switching operation 
strategy at last without the prior knowledge of traffic loads. We will give the details in Section 



B. Problem formulation 

In this paper, we primarily aim to minimize the whole-scale energy consumption of BSs 
in RANs. Our previous work 0T] has shown the energy consumption of BS is not linearly 
proportional to the traffic load within its coverage area. Moreover, the energy consumption of 
BSs consists of two categories: constant one and variant one that is proportional to BS's traffic 
load. Hence, we adopt the generalized energy consumption model flU, which can be summarized 
as 



where p = • • • ,Pn}- Besides, ^ E (0, 1) is the portion of constant power consumption for 
BS i, and P 4 is the maximum power consumption of BS i when it is fully utilized. 

Above all, our problem is to find an optimal set of active BSs and corresponding user 
association that minimizes the function of the energy consumption, namely 




(i) 



Bon,P 



min {^(p,# on )} , 



(2) 



s.t. pi E [0, 1) Vi E B. 
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III. Stochastic BS switching operation in reinforcement learning framework 
A. Markov decision process 

An MDP is defined as a tuple M =< S,A,p,C >, where § is the state space, A is the 
action space, p is a state transition probability function, and C is a cost function^] Specifically, 
at stage k, the traffic load state is s fc . The controller choose to turn some BSs into sleeping 
mode (Action a fc ) and the users correspondingly associate themselves with the remaining active 
BSs according to an indicator set Thereafter, the traffic load state will transform into s k+1 
with probability p(s k+1 \s k , a fc ). Meanwhile, the immediate cost generated by the environment 
(computed by Equation ([T])) is fed back to the agent, i.e., the BS switching operation controller. 

The goal is to find a strategy n, which maps a state s to an action 7r(s), i.e., a fe , to minimize 
the discounted accumulative cost starting from the state s. Formally, this accumulative cost is 
called as a state value function, which can be calculated by lfl6l 



V*(s) = E„ 



S fc ,7Tfs fc )|s° 



k=0 



(4) 

C7(s,7r(s))+7Ep(s / |s,7r(s))^(s / )' 

s'es 

where 7 is the discount factor that maps the future cost to the current state. Given the diminishing 
importance of future cost than the current one, 7 is smaller than 1. The optimal strategy it* 

2 lt may be a reward function R on the basis of specific research scenarios. Moreover, it's worthwhile to note here that we 
use the lowercased Ci(x,B n) to denote transmission rate from BS i to location x while the uppercased C denotes the cost 
function. 

3 In this paper, we adopt and modify the approach for user association in (6). At stage k, the user association set I h , which 
achieves the minimization of total cost, would be that users at location x choose to join BS i*, while i* satisfies 

% [x] = arg max -fr 1 — j-, \/x £ C. (3) 

j'eBo*. (1 - qj)Fj 

Intuitively, Equation |3} means that users at location x prefer to choose to join the BS with the largest transmission rate at the 
same traffic load-variant power consumption. 

It's worthwhile to note here that this user association scheme may degrade the quality of experience (QoE), such as increasing 
the delay, etc. We leave how to strike the balance between the user QoE and energy consumption as future work. 
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satisfies the Bellman equation ffTBTl : 

V*(s) = V**(s) 

(5) 



min < Et, 

aeA 



C(s,a)+ 7 ^p(s>,a)\/ 7r *(s / ) 

s'es 

Since the optimal strategy not only minimizes the current cost, but the cumulative cost from the 
beginning, it contributes to design a foresighted energy saving scheme. 

B. The actor-critic learning framework for energy saving scheme 

There have been some well-known methods to solve the MDP issues such as dynamic pro- 
gramming [fT6ll . Unfortunately, these methods heavily depends on prior knowledge of the en- 
vironmental dynamics. However, it's challenging to know the future traffic loads precisely in 
advance. Therefore, in this paper, we employ an actor-critic algorithm, one kind of reinforcement 
learning to solve the MDP problem. The reasons to adopt actor-critic algorithm are twofold ll27Tl : 
(i) since it generates the action directly from the stored policy, it requires little computation to 
select an action to perform; (ii) it can learn an explicitly stochastic policy which may be useful 
in non-Markov traffic variation environment of RAN. 

As the name suggests, the actor-critic algorithm encompasses three components: actor, critic, 
and environment as illustrated in Fig. [2] (Left). At a given state, the actor selects an action in a 
stochastic way and then executes it. This execution transforms the state of environment to a new 
one with a certain probability, and feeds back the cost to the actor. Then, the critic criticizes the 
action executed by the actor through a time difference (TD) error. After the criticism, the actor 
will prefer to select the action yielding a smaller cost with a higher tendency, and vice versa. 
The algorithm repeats the above procedure until convergence. 

We design an actor-critic learning framework for energy saving scheme as illustrated in Fig. 

® 

1) Action selection: Beforehand, let's assume that when the controller needs to select an action, 
the system is at the beginning of stage k. Meanwhile, the traffic load state is s fc . Thereafter, the 
controller selects an action according to a stochastic strategy, the purpose of which is to improve 
performance while explicitly balancing two competing objectives: a) searching for a better BS 
switching operation (exploration) and b) taking as little cost as possible (exploitation), such that 
the controller not only performs the good BS switching operation based on its past experience 
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but also is able to explore a new one. The most common methodology is to use a Boltzmann 
distribution. The controller chooses an action a in state s k of stage k with probability [16] 

7r fe (s fc a) = exp{p(s*,a)/T} (6) 

where r is a positive parameter called the temperature. In addition, p(s k , a fc ) indicates the 
tendency to select action a k at the state s fc , and it will update itself after every iteration. It's 
worthwhile to note that though there exists the possibility that the remaining active BSs are not 
enough to serve the traffic loads in the present stage k, the controller can start an emergent 
response paradigm to quickly turn on some BSs in this case as the conventional energy saving 
scheme commonly does, which is out of the scope of this paper. Hence, in this paper, we assume 
the action a fc , which the controller finally chooses, can meet the traffic load requirement. 

(2) User association and data transmission: After the controller chooses to turn some of BSs 
into sleeping mode, the users at location x choose to connect one BS according to Equation ([3]) 
and start the data communication. 

(3) State-value function update: After the transmission part of stage k, the traffic loads in each 
BS will change, thus transforming the system to state s fc+1 . Meanwhile, the total cost for the 
transmission would be C fc (s fc ,a fc ). Consequently, a TD error 8(s k ,a k ) would be computed by 
the difference between the state- value function V k (s k ) estimated at the preceding state and the 
one C fe (s fc , a fc ) + 7 • V k (s k+1 ) at the critic, namely 

5 k (s k , a fc ) = C fc (s fc , a fe ) + 7 V p(s'\s k , a k )V(s') - V(s k ) 

= C k (s k , a fe ) + 7 • V k (s k+1 ) - V k (s k ). 
Afterwards, the TD error would feed back to the actor. By the way, the state-value function 
would be updated as 

V k+1 (s k ) = V k (s k ) + a(z/i(s fc , k)) ■ 8 k (s k , a k ). (8) 

Here, i>i(s k ,k) denotes the occurrence times of state s k in these k stages. ct(n) is a positive 
step-size parameter that affects the convergence rate. On the other hand, if s ^ s k , V k+1 (s) will 
be kept the same as V k (s), namely V k+1 (s) = V k (s),\fs e § but s ^ s k . 

(4) Policy update: At the end of stage k, the critic would employ the TD error to "criticize" 
the selected action, which is implemented as 

p k+1 {s k , a fc ) = p k {s k , a fc ) - P{u 2 {s k , a fe , k)) ■ 5 k {s k , a fc ), (9) 
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Similar to ^i(s fc , k), ^i^, ak , k) indicates the executed times of action a fe at state s k in these k 
stages. (3{n) is a positive step-size parameter. Equation ^ and Equation ^ ensure one action 
under a specific state can be selected with higher probability if the "foresighted" cost it takes is 
comparatively smaller, i.e., §(s k ) < 0. Additionally, if a 7^ a k , p k+1 (s k , a) will remain unchanged, 
namely p k+1 (s k , a) = p k (s k , a), Va 6 A but a 7^ a fe . 

If each action is executed infinitely often in every state, in other words, if in the limit, the 
learning strategy is greedy with infinite exploration, the value function V(s) and strategy 7r fc (s, a) 
will finally converge to V* and n* with probability (w.p.) 1 as k — > 00 [28 1- 

IV. Transfer actor-critic algorithm for stochastic BS switching operation 
A. Motivation and formulation of transfer actor-critic algorithm 

The previous section addresses the methodology to exploit the classical AC algorithm to 
conduct the BS switching operation, culminating in an effective energy saving strategy in the 
end. In this section, we present the means that the controller utilizes the knowledge of learned 
strategy in a neighboring region or a historical period to help itself be in the groove of finding 
the optimal BS switching operation. 

Basically, the policy, say p(s, a), which finally determines the strategy 7r(s, a) in one learning 
task, indicates the tendency of action a to be chosen in state s. When the learning process 
converges, the tendency to choose a specific action a is comparatively larger than that of other 
actions. In other words, it means that if the controller decides the BS switching operation 
according to action a, the energy consumption reduction in the whole system is tending to 
be optimized in the long run. Hence, if the knowledge of this policy p(s, a) is transferred to 
another task, i.e., the knowledge transferred from Region 1 (source task) to Region 2 (target 
task) in Fig. [T] the controller in the target task can make an attempt by taking the same action a 
when the traffic loads come into state s. Compared to learning from the scratch, the controller 
might directly make the wisest choice at the very beginning. However, in spite of the similarities 
between the source task and the target task, there still exist the differences. For example, the 
system might come into the same state in two different tasks, whereas the traffic loads in the 
source task (i.e., Region 1) might be usually higher than that in the target one (i.e, Region 2). 
Hence, instead of staying on the chosen action a, the controller can make a more aggressive 
choice to turn more BSs into sleeping mode, thus saving more energy consumption. Consequently, 
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in this case, the transferred policy guides in a negative manner. To avoid this underlying problem, 
the transferred tendency should have a decreasing impact on choosing a certain action once the 
controller has attempted to choose this action and nurtured its own learning experience. 

Afterwards, we propose a new policy update method for Transferred Actor-CriTic algorithm 
(TACT) as Fig. [2} In the TACT algorithm, the overall policy to select an action p Q is divided 
as the native one p n and the exotic one p e . Without loss of generality, let's assume that at stage 
k, the traffic load state is s k and the chosen action is a k . Accordingly, the overall policy p is 
updated as 

p k +1 (s k ,a k ) = [(l-C(^(s fe ,a fe ,A ; )))^+ 1 (s & ,a fe ) + C(^(s A ,a fe ) A ; ))p e (s fc ) a fc )] p _; ( , (10) 

where [x] b a with b > a, denotes the Euclidean projection of x onto the interval [a,b], i.e., [x] = a 
if x < a, [a;] = b if x > b, and [x] b a = x if a < x < b. In this case, a = —p t and b = p t , 
with p t > 0. Additionally, if a a k , p k+1 (s k ,a) will remain unchanged, namely p k+1 (s k ,a) = 
p k (s k , a), Va G A but a ^ a k . Besides that, p n (s, a) still updates itself according to the classical 
actor-critic algorithm, namely Equation Q. 

Initially, the exotic policy p e (s, a) dominates in the overall strategy. Hence, when the envi- 
ronment enters a state s, the presence of p e (s,a) contributes to choose the action, which might 
be optimal to s in the source task. Consequently, the proposed tendency update method leads to 
a possible performance jumpstart. Beyond that, £(rt) G (0, 1) is the transfer rate and £(n) — > 
as n — > oo. The existence of £(n) continuously decreases the effect of the transferred exotic 
policy p n (s, a). Therefore, the controller can not only take advantage of the learned expertise in 
the source task, but also swiftly get rid of the negative guidelines. 

Finally, we summarize our proposed TACT algorithm in Algorithm [T] . 

B. Convergence analysis 

Next, we are interested in the convergence of TACT algorithm. We start the analysis by 
introducing several related lemmas. Singh [28J shows that the Boltzmann method is greedy in 
the limit with infinite exploration, based on a large enough r. Therefore, we have the following 
lemma. 

Lemma 1. If we use the Boltzmann exploration method with a large enough r, there thereby 
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Algorithm 1 TACT : The Transfer Learning Framework for Energy Saving Scheme 
Initialization: 

for each s G §, each aG Ado 

Initialize state-value function V(s), native policy function p n (s, a), and strategy function 

7r(s,a); 



Repeat until convergent 

1) Choose an action a k in state s k according to 7r(s fc ,a fc ); 

2) Users at location x connect one BS % by i*(x) = argmax jeBon c ^'^p. , Vx G £ and 
then start data transmission; 

3) If Pi < 1, Vi G £, the chosen action is feasible. The cost function C(s k , a fc ) is calculated 
by J2i£B on [(1 ~~ Qi)PiPi + otherwise, an emergent response paradigm starts as the 
conventional scheme does. 

4) Identify the traffic loads and accordingly update state s k — y s k+1 and compute the TD 



error by 5 k {s k ) = C k {s k ,Si k ) + 7 • V k {s k+1 ) - V k {s k ); 

5) Update the state-value function V(s k ) by V k+1 (s k ) = V k (s k ) + a^v^ , k)) ■ 5 k (s k ); 

6) Update the native tendency function p n (s k ,a. k ) by p k+1 (s k ,a. k ) = p k l (s k ,a k ) — 
(3(u 2 (s k ,a k ,k)) ■ 5 k (s k ,a k ), and update the function p (s k ,a k ) by p k+1 (s k ,a k ) = 
[(1 - C(^ 2 (s fc , a fc , k)))p k+1 (s k , a k ) + C(u 2 (s k , a fc , k))p e (s k , a fc )]^; 

7) Update the strategy function 7r fc+1 (s fc , a) = - ^^^vfe^ 1 ^ r , for all a G A. 



exists an 77 > 0, such that 



end for 




> rj, Vs G §, a G A. 



(11) 



In other words, as k — y 00, i/ 2 (s, a, k) — rjk — >■ 00. 



Definition 1. Define a function i? S) a(p ) as 



if p (s, a) = ^ 



and 5(s, a) > 



^s,a(Po) = < or p (s, a) = -p t 



and 5(s, a) < 0, 



(12) 



1 otherwise. 
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The next theorem states that our proposed policy update tracks an ordinary differential equation 
(ODE). 

Theorem 1. Assume that the learning rate (3{n) in Equation ^ satisfies 

oo oo 

P(n) = oo, Pin) > 0, P( n ) 2 < °°> ( 13 ) 

n=0 n=0 

and the transfer rate ((n) satisfies lirn£(n)//?(n) — > as n — > oo. p (s,a) asymptotically track 
the solution of the ODE 

p (s, a)(t) = -6(a, a)7? s , a (p ), Vs e S, a e A, (14) 

where 5(s, a) = lim5 fc (s,a) as k — > oo. 

Proof: The proof is given in Appendix. ■ 
In addition, we introduce the definition of a strict Lyapunov function Il29ll , which is the 
fundamental of our following proof. 

Definition 2. Suppose that for an ODE z(t) = f(z) defined on a region V, V(z) is a continuously 
differentiable and real-valued function of z such that V(0) = 0, V(^) > 0,V^ ^ 0, if V(t) = 
VV • = VI 7 • f(z) < on the region V, and the equality holds only when z(t) = 0, the 
function V(z) is a strict Lyapunov function for the ODE z(t). 

Our proof relies on the following theorem by Konda and Borkar [fT4l . which establishes the 
convergence of a general actor-critic algorithm. 

Theorem 2. Assume that the learning rate a(n) satisfies the assumptions in Section 2.2 lfl4ll and 
f3(n) and ((n) meet the conditions in Theorem [Tj If the strategy it, which is derived by Equation 
([6]) with the policy update method given by Equation (10), has a strict Lyapunov function for 



the ODE 7r(t), we thereby have n converges w.p. 1 and \\n — < e for any e > as p t — > oo. 

Beforehand, it comes the following lemma by directly applying Equation (|4]) in Equation ([7]). 
Lemma 2. 

^5(s,a)7r(s,a) = 0,Vs G S. (15) 



aei 



Lemma 3. If the strategy 7r(s, a) tracks the solution of ODE 7r(s, a)(£), and 7r(s, a)(t) satisfies 
7r(s,a)(t)5(s,a) < 0, then we have VF 7r (s)7r(s, a) (t) < tt(s, a)(t)5(s, a) < 0,Vs e S. 
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Proof: For two distinct policies n and tt', let's define a value function operation T(w', V^s)) 
= E n > [C(s, a) + 7 X] s 'e§P( s 'l s ' a )^ /7r ( s ')] • Assume that there exists an infinitesimal e > such 
that 7r + eh {t) is still a valid strategy. If denote n' = n + e7r(t), we thereby have 



T(7r',V^(s))-V' r (s) = J E^ 



C(s,a)+ 7 ^p(s , |s,a)^(s') 



X] ^ (7r(s,a) + e7r(s,a)) 

aeA L 

y^(7r(s, a) + e7r(s, a))<5(s, a 

aeA 

2J e7r(s, a)5(s, a) < 



C(s, a) + 7 J>( s 'l s > a)^(s') - V» 



aeA 

The last equality follows from Lemma [2] 

Denote an iteration operation of T(ir', V w (s)) as T n (n', V*(s)) = T^^tt' ,T(tt' , V w (s))), we 
have T n (7r', ^(s)) < T^^tt', V^s)) < - < V^s). 

In addition, T n (7r', V*(a)) - V w (s) < £ e7 H s > a)5(s, a )> for n > 1. As n -»• 00, T(tt', 7 ff (s)) 

aeA 

— >■ V 7 " (s), we obtain 

- y ' W = - ^ (s) < *(.,.)«(.,.) < 0. 

e e 

As e -> 0, VV 7r (s)7r(s,a)(t) < tt(s, a)5(s, a) < 0. The claim follows. ■ 

Theorem 3. E V*(s) is a strict Lyapunov function for ODE h(t). 

ses 

Proof: By explicit differentiating Equation ([6]) over p (s,a)(t), we have 



^exp[p (s,a)/r] . i exp [p (s, a)/r] £ , A {exp [p (s, a')/r] p e (s, a')} 
p (s,a) 



Ea'eA ex Pbo(s,a')/r] J ^ ' ^ {Ea'eA ex P bo(s, a')/r] } 

-tt(s, a)p (s, a) - -tt(s, a) V tt(s, a')p (s, a') 

a'eA 

-vr(s, a)p (s, a) - -tt(s, a) V" tt(s, a') [-<J(s, a')0 B , a /(p o )] 

T T ' * 



a'eA 



j7r(s,a)p (s,a). 
The last equality follows from Lemma |2j By Theorem [Tj 



7r(s, a)(t)<5(s, a) 



-7r(s,a)5(s,a)i? Sja (p ) 



<5(s,a) = — 7r(s,a) [5(s,a)] $ s , a (p ) < 0. 

T 
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The equality only holds at the equilibrium point p (s, a) = 0. By Lemma[3} W 7r (s)7r(s, a)(£) < 
0. Therefore, according to Definition [2j the claim follows. ■ 

Theorem 4. Regardless of any initial value chosen for p°(s, a), and transferred knowledge 
p e {s, a), if the learning rate a(n), (3{n) and the transfer rate ((n) meets the required conditions 
meanwhile p t and r are sufficiently large, the Algorithm [T] converges. 

Proof: The proof is the direct application of Theorem [2j which establishes the convergence 
given two conditions. First, the policy p (s, a) tracks the solution of an ODE, by Theorem [T] 
Second, the tracked ODE has a strict Lyapunov function, by Theorem [3} Therefore, the learning 
process in Algorithm [T] converges. ■ 

V. Numerical analysis 

We validate the energy efficiency improvement of our proposed scheme by extensive simu- 
lations under practical configurations. Here, we simulate for a region consisting of three macro 
BSs and three micro BSs in an area of 1.5km x 1.5km as Fig. [5] shows. Moreover, we assume 
that file transmission requests at location x E C follow a Poisson point process with arrival rate 
A(x) and file size l/p,(x) = 100 kbyte. Beyond that, we assume the maximum transmission 
powers for BSs, i.e., 20W and 1W for macro and micro BSs, respectively. Based on the linear 
power consumption relationship in [6|, the maximum operational powers for macro BS and 
micro BS are 865W and 38W, respectively. We set other main parameters in the propagation 
model according to the COST-231 modified Hata model [T30l as summarized in Table [TJ 

As for the proposed TACT algorithm, it's implemented with a discount factor 7 = 0.001 and 
the temperature r = 500. Based on lfT4l . the learning rate a(n) = 1/n while f3{n) — 1 J (nlogn). 
Moreover, the transfer rate £(n) = 9 n , with the transfer rate factor 9 E (0, 1), thus satisfying the 
assumption in Theorem [T] 

By the way, we define cumulative energy consumption ratio as the metric to test how much 
energy saving can be achieved due to the application of our proposed scheme. Specifically, we 
define the cumulative energy consumption ratio as: the ratio between the accumulative energy 
consumptions when certain BSs are turned off (as our scheme runs) and when all the BSs stay 
active since our simulation starts. Our definition is reasonable since it can show the foresighted 
energy efficiency improvement, which is exactly the goal of an energy saving scheme. Besides, 
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TABLE I 

Used simulation parameters 



Parameter description Value 

Simulation area 1.5km x 1.5km 

Maximum transmission power Macro BS 20W 

Micro BS 1W 

Maximum operational power Macro BS 865W 

Micro BS 38W 

Height Macro BS 32m 

Micro BS 12.5m 

Channel bandwidth 1.25MHz 

Intra-cell interference factor 0.01 

File requests Arrival rate 5 x 10 -6 ~ 10~ 4 

File size lOOkbyte 

Constant power percentage 0. 1 ~ 0.9 



a For simplicity, we don't consider fast fading effect and noise 
influence in our simulation. 



we compare the performance of our proposed schemes, i.e., classical actor-critic (AC) based and 
transfer actor-critic algorithm (TACT) based energy saving scheme with that of the state-of-the- 
art (SOTA) scheme, which assumes the controller can obtain a full knowledge of traffic loads 
in prior and find the optimal BS switching solution by exhausting all the possible ones. 

A. Effect of traffic loads with static arrival rates on energy saving scheme 

We firstly examine how much energy saving can be achieved versus different static traffic 
load arrival rates. (61 shows a homogeneous traffic distribution of A(x) = 1CT 4 for all x G C, 
which offers load corresponding to about 10% of BSs utilizations when all BSs are turned on. 
Therefore, we vary the homogeneous traffic arrival rate X(x) from 5 x 10~ 6 to 10~ 4 . Meanwhile, 
to compute the traffic load boundary points r&, we record the average of traffic loads, i.e., T a , 
in the whole region and then compute T b for macro BSs and micro BSs by T bmacro = ^f, 

^b,micro Jq^ ^ b,macroi respectively. 

Fig. [6] shows the effect of traffic load on energy savings when the portion of fixed power 
consumption q { equals 0.5. Meanwhile, the transferred policy is generated from a source task 
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with the static arrival rate A = 5 x 10~ 6 . With the decrease of traffic load arrival rate A from 
10~ 4 to 5 x 1CT 6 , we can expect more significant energy conservation since if all the BSs stay 
active under lower traffic loads, the BSs are highly under-utilized. Moreover, the cumulative 
energy consumption ratio continues decreasing as the simulation runs since the controller will 
have a better understanding of the traffic loads, thereby knowing whichever action has a better 
energy efficiency. Unfortunately, since the proposed learning schemes are performed without 
the knowledge of traffic loads a prior, the performance of them are inferior to that of the 
SOTA scheme, especially at the beginning of the simulations. However, we can see that the gap 
compensated for the absent knowledge becomes smaller, when the TACT scheme is applied with 
the learned knowledge. 

Fig. |7]presents the performance improvement of TACT scheme over classical AC scheme. As 
expected, the TACT scheme yields a relatively large performance improvement, especially at the 
beginning of each simulation. In other words, the TACT scheme contributes to a performance 
jumpstart, or a faster convergence speed. Fig. [7] also depicts the similarity between the source task 
and the target task, measured by Kullback-Leibler divergence PH . It shows a smaller Kullback- 
Leibler divergence between the source task and the target task leads to a more efficient transfer 
effect. Besides, we also plot the impact of transfer rate factor 9 in Fig. [8] Generally speaking, 
as we expect, larger 9 results in higher convergence rate and lower energy consumption ratio. 

B. Effect of energy consumption models of BSs on energy saving scheme 

In this part, we vary the portion of fixed power consumption q^ between and 1, in order to 
cover various types of BSs with different energy consumption models. Fig. [9] shows the effect 
of energy consumption models of BSs on energy saving schemes when the traffic file request 
follows a homogeneous distribution with arrival rate X(x) from 5 x 1CT 6 to 1CT 4 . In this case, 
the transferred policy for a target task with a specific arrival rate A is the learning result from 
a source task with the same arrival rate A and an energy consumption model qi = 0.5. As Fig. 
[9] depicts, the schemes will perform better when the constant power consumption accounts for 
a larger proportion of the whole energy consumption. The reason lies in that when the constant 

4 The performance improvement is calculated by dividing the energy consumption margin between TACT scheme and classical 
AC scheme over the energy consumption using classical AC scheme. 
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power consumption takes a larger percentage, i.e., = 0.9, turning off one under- utilized BS 
will make a clearer difference and save more energy. On the other hand, more than half of 
the overall energy consumption usually takes place on the constant power, i.e., cooling, idle- 
mode signaling and processing in the present RAN infrastructure [7J. Therefore, our proposed 
scheme can render a strong positive effect in saving energy. It can also be found in Fig. |9} the 
performance of TACT scheme obviously outperforms that of the classical AC scheme in all the 
energy consumption configurations. 



C. Performance of learning framework-based energy saving scheme in periodic traffic load 
scenario 

In this section, we investigate the performance of the proposed scheme when traffic loads 
periodically fluctuates. Ifl2l shows practical traffic load profile is periodical and can be approx- 
imated by a sinusoidal function X(t) = \y • cos(27r(t + 4>)/D) + Xm, where t is the index of 
time, D is the period of a traffic load profile, Ay is the variance of traffic profile and Am is 
the mean arrival rate. Therefore, we employ X(t,x) = (0.99 • cos(27r(t + 10)/24) + 1) x 10~ 4 



to approximate the practical traffic load arrival rate at location x £ £. Fig. 10 compares the 
performance of the proposed schemes and shows that the TACT scheme converges faster than 
the classical AC scheme. 



VI. Conclusion 

In this paper, we have developed a learning framework for BS energy saving scheme. We 
specifically formulated the BS switching operation under a variant traffic load as a Markov deci- 
sion process. Besides, we adopt the actor-critic method, a reinforcement learning approach to give 
the BS switching solution to decrease the overall energy consumption. Afterwards, to fully exploit 
the spatial and temporal relevancy in traffic loads, we propose a transfer actor-critic algorithm 
to improve the strategies by taking advantage of learned knowledge from neighboring regions 
or historical periods. Our proposed algorithm provably converges given certain restrictions that 
arise during the learning process, and the extensive simulation results manifest the effectiveness 
and robustness of our energy saving schemes under various practical configurations. 
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Appendix 

Proof of Theorem [TJ 

Proof: Without loss of generality, assume that at stage k, the state is s k and the chosen action 
is a k . Moreover, the latest stage that the state-action pair (s k ,a k ) occurred is stage m. Thus, by 
Algorithm [TJ the policy p J (s k , a k ) remain invariant for any j G [m, ■ ■ ■ ,k). For simplicity of 
representation, we denote one sequence p k (s k , a k ) = p k (s k , a k ) and p k ~ 1 (s k , a fc ) = pi(s k , a k ), for 
an Y 3 £ [ m > ' ' ' j k), where the index k equals z/2(s fc , a k , k). In addition, the sequences p k (s k , a k ) 
and 5 k (s k ,a k ) are defined analogously to p k (s k ,a k ). Thus, we have 

p k Q (s k ,a k ) =p k (s k ,a k ) 

= [(1 - CMs fe , a fc , k) - l))p k n (s k , a k ) + C(z/ 2 (s fc , a fe , k) - l)p e (s k , a*)]^ (16) 
= "(l-C(^-l))^(s fc ,a fc ) + C(fc-l)Pe(s fc ,a fc )l Pt . 

L -I -Pt 

Firstly, assume that p t is large enough such that |po(s fc ,a fc )| < p t and \p k+1 (s k , a k ) | < p t , 
while the assumption will be dropped later. 



Subtracting Equation (10) to Equation (16), we obtain 



p k+1 {s k ,a k ) -p k (s k ,a k ) 



1 - C(* - 1)) (^ +1 (« fc , a fe ) - Pt(s k , a k )) - (C(k) - C(k - 1)) (pt +1 (s k , a k ) - p e (s k , a k )) 
-P(k)(l - ((k - l))5\s\ a fc ) - (((k) - C(k - 1)) (^ +1 (s fe , a fc ) - p e (s k , a k 



(17) 

The last equality holds because of Equation ([9]). 

Define t = and t~ k = Y?jZoP(j)- F° r t > 0, let &(t) denote the unique value of k such 
that t~ k < t < t k+l , as Fig. |4]-(a) depicts. For t < 0, set &(t) = 0. Define the continuous time 
interpolation p° sk afc (-) on (— oo, oo) by p° sk k (t) = p°(s fc , a k ) for t < 0, and for t > 0, 

P%,At)=Pi(s k ^ k )' for t k <t<t k+v 
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Moreover, we define the sequence of shifted processes p k k ak (t) = p® k ah {t% + 1), t £ (—00, 00), 
as Fig. [4]-(d) depicts. Define Yj — and = for j < 1. Moreover, define Y,- = (1 — £0 — 
l))S j (s k , a k ) and 2,- = (CO*) - CO - 1)) (pL +1 (s k , a fc ) - p e (s k , a fc )) for j > 1. Define Z°(*) = 
for t < and 

Z°(t) = £ Z,-, = z°(^ + 1) - z\t k ) = J2 z n 1 * o- 

3=0 j=k 

Taking into count the definitions above (recall that R(tt) = k), the following equation can be 



achieved by a manipulation of Equation (17) 



£(tg+t)-l &(t k +t)-l 

P^(*)=Po(s & ,a fc )- G(?(7)n- + ^-)=p5(8*a fc )- ^ {fi{j)Yi)-Z\t). (18) 



Since p\ fc (t) is piecewise constant, we can rewrite Equation (18) as 



//, _ a , (/) = ^(s fc ,a fe ) - fY m . k+x) dx - Z fc (t) + ^(t), (19) 



where <^ fc (t) is the outcome due to the replacement of the first sum in Equation ( 18 ) by an integral. 
ip k (t) = at the times when the interpolated sequences have jumps, i.e., t = tp — tr, k' > k, 
and <p k (t) in t as k — >■ 00 under the assumption in Equation (13). 

Besides that, by our assumption that \im((k)/ (3(k) — > as k — > 00, Zt = (((k) — ((k — 
1)) • (pl +1 (s k ,SL k ) -p e (s k ,a k )^ = o(f3(k)) (pi +1 (s k ,a k ) -p e (s k ,a k )y Therefore, Z k {t) = 

Y^-i. +t ^ 1 °WU)) te +1 ( sfc ) ai: ) — Pe(s fc ,a fc )). Thus, as k — > 00, Z k (t) is negligible, since it's a 

3 — 

small order of magnitude to X^-f 1 

j—k 

Given the above discussion, as k — > 00, the sequence of functions p k k &fe (£) = Po(s fc ,a fe ) — 
J * y^( t . +a; )tix is equicontinous. Hence, by the Arzela-Ascoli Theorem [|29l . there is a convergent 
subsequence in the sense of uniform convergence on each bounded time integral, and it's easily 
seen that any limit of p s k a k(t), or the discrete equivalent p k (s k ,a k ), must track the solution of 
the ODE p s k a k(t) = — 5 k (s k , a fc ) for sufficiently large k. 

Next, in the special case where p k ~ l {s k ) a k ) = p t and S k ~ 1 (s k ,a k ) > 0, at next stage k, the 
overall policy p k (s k ,a k ) would equal p t . Thus, the ODE p s k^ a k(t) = 0. Similar discussion can 
be easily applied to the case, where p k ~ 1 (s k ,a k ) = —p t and d k ~ 1 (s k ,a k ) < 0. 

Furthermore, as k — > 0, by Lemma [lj k = 1*2 (s, a, k) —¥ 00. 
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Summarizing the above discussion and taking into account 5(s k ,a. k ) = lim<5 fc (s fc , a fc ) as k — > 
oo, we can obtain 
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Learned Knowledge Transfer I Macr ° Bs X Micro BS 




Fig. 1. Transfer learning for reinforcement learning in BS switching operation scenario. 




Fig. 2. 



Classical Actor-Critic Algorithm Transfer Actor-Critic Algorithm 

Architecture of classical actor-critic algorithm and transfer actor-critic algorithm (TACT). 
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Fig. 3. Illustration of actor-critic learning framework for energy saving scheme. 
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Fig. 4. Illustration of (a) the function A(t), (b) the function p® k afc (t), (c) the function &(tt +t) and (d) the function p^ k 
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Fig. 5. Illustration of BS deployment in our simulation scenario. 
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Fig. 6. Performance comparison among classical AC scheme, TACT scheme and SOTA scheme under various homogeneous 
traffic arrival rates when the transfer rate 6 = 0.1. 
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Fig. 7. Performance improvement of TACT scheme over classical AC scheme versus Kullback-Leibler divergence when the 
transfer rate 9 = 0.1. 
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Fig. 8. Performance impact of the transfer rate factor 9 to the TACT scheme when A = 5 x 10 
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Fig. 9. Performance comparison among classical AC scheme, TACT scheme, SOTA scheme under different energy consumption 
models after 1500 iterations when the transfer rate = 0.1. 
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Fig. 10. Performance comparison of classical AC scheme, TACT scheme, SOTA scheme with time-variant traffic arrival rate 
X(t,x) = (0.99- cos(27r(t + 10)/24) + 1) x 10~ 4 when 6 = 0.1. 



